Deep Learning for Nasopharyngeal Carcinoma Segmentation in Magnetic Resonance Imaging: A Systematic Review and Meta-Analysis

Nasopharyngeal carcinoma is a significant health challenge that is particularly prevalent in Southeast Asia and North Africa. MRI is the preferred diagnostic tool for NPC due to its superior soft tissue contrast. The accurate segmentation of NPC in MRI is crucial for effective treatment planning and prognosis. We conducted a search across PubMed, Embase, and Web of Science from inception up to 20 March 2024, adhering to the PRISMA 2020 guidelines. Eligibility criteria focused on studies utilizing DL for NPC segmentation in adults via MRI. Data extraction and meta-analysis were conducted to evaluate the performance of DL models, primarily measured by Dice scores. We assessed methodological quality using the CLAIM and QUADAS-2 tools, and statistical analysis was performed using random effects models. The analysis incorporated 17 studies, demonstrating a pooled Dice score of 78% for DL models (95% confidence interval: 74% to 83%), indicating a moderate to high segmentation accuracy by DL models. Significant heterogeneity and publication bias were observed among the included studies. Our findings reveal that DL models, particularly convolutional neural networks, offer moderately accurate NPC segmentation in MRI. This advancement holds the potential for enhancing NPC management, necessitating further research toward integration into clinical practice.


Introduction
Nasopharyngeal carcinoma (NPC) is a distinct head and neck cancer subtype originating in the nasopharynx, the upper region of the throat posterior to the nasal cavity [1].Despite its rarity on a global scale, NPC exhibits a higher incidence in specific geographic regions, such as Southeast Asia and North Africa, likely attributable to a combination of genetic, environmental, and Epstein-Barr virus-related factors [2,3].The early detection and accurate diagnosis of NPC are paramount for optimal treatment planning and improving patient prognosis [4].However, the complex anatomy of the nasopharynx and the variability in clinical presentation make early detection and accurate diagnosis of NPC challenging.
In this context, magnetic resonance imaging (MRI) is the preferred imaging modality for the diagnosis, staging, and treatment planning of NPC due to its superior soft tissue contrast resolution compared to other imaging techniques, such as computed tomography (CT).MRI's excellent contrast resolution allows for an accurate delineation of the primary OR magnetic resonance imaging OR MR) AND (segmentation OR contouring OR delineation) AND (deep learning OR convolutional neural networks OR CNN)) and are further detailed in Table S3.The process included title and abstract screening supplemented by manual searches to capture pertinent studies comprehensively.Any disagreements in study selection were resolved by consulting a third expert.We only included studies that applied DL for NPC segmentation in adult patients using MRI scans.Exclusions were made for non-MRI studies, retracted conference papers, Supplementary Materials, studies not addressing the research question directly, or those lacking necessary data for meta-analysis (e.g., missing standard deviation of Dice scores).

Data Extraction and Management
T-WW and C-KW collected key data from the chosen studies, including the study design, patient counts, and the number of series in training and testing sets.They also reviewed the sources of data, the validation techniques used for the models, and the standards for establishing reference values and indicators for ground truths.The documentation included MRI image specifics like magnetic field strength, sequences, and the manufacturer and model of the MRI equipment.The evaluation of the algorithms focused on their dimensions and types.This was accompanied by a detailed review of preprocessing methods, covering normalization, resolution resampling, data augmentation, and image cropping techniques.An extensive evaluation of the Sørensen-Dice coefficient was performed, highlighting its crucial role in assessing segmentation accuracy in these studies.

Methodological Quality Appraisal
Two established tools were used to evaluate the methodological quality of the studies: the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) and the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) [17,18].T-WW and C-KW conducted these assessments independently to minimize bias.Disagreements were resolved by consulting senior researchers and ensuring a consensus-based, rigorous quality assessment.This approach underscores the commitment to methodological precision and consensus in evaluating study quality.

Statistical Analysis
Two meta-analyses assessed the Dice scores reported by the studies.The first analysis selected the highest-performing algorithm when multiple outcomes were reported per study or when different studies used the same validation dataset.Median and interquartile ranges were converted to mean and standard deviation using established formulas [19,20].A random effects model with restricted maximum likelihood was applied to accommodate study population heterogeneity [21], visualized through forest plots and assessed via sensitivity analysis (leave-one-out method) and subgroup analyses on variables like publication status [22].The Q test quantified heterogeneity across studies, setting statistical significance at a p-value of <0.05.Heterogeneity levels were categorized by I 2 values as trivial (0-25%), minimal (26-50%), moderate (51-75%), and pronounced (76-100%) [23].To assess publication bias, Egger's method for funnel plot asymmetry was employed, utilizing Stata/SE 18.0 for Mac [24].
The second meta-analysis explored DL algorithm performance variability across validation sets, addressing dataset reuse by comparing bi-level and tri-level random effects models, the latter clustering by dataset to mitigate mixed effects from validation reuse.The variance was assessed across three levels-datasets, repeated analyses, and study samples-using analysis of variance and Cheung's formula [25].Meta-regression [26] incorporated moderators like dataset splitting (train/test vs. cross-validation), the validation method (internal validation vs. external validation), MRI sequence (single vs. multiple), algorithm type (U-net, U-net variants vs. CNN), training size, and preprocessing techniques (intensity normalization, resolution adjustment, image augmentation, and image cropping).
Statistical analysis was conducted with the metafor package in R, considering p < 0.05 as significant.

Study Identification and Selection
The PRISMA diagram (Figure 1) illustrates the exhaustive search and selection methodology adopted in the present investigation.Initially, a comprehensive search was conducted across various databases from inception to 20 March 2024, yielding 176 studies, comprising 36 from PubMed, 72 from EMBASE, and 68 from Web of Science.After 66 duplicates were removed, 110 articles were further assessed using EndNote software.An initial review of titles and abstracts led to the exclusion of 36 articles, attributed to their irrelevance or lack of comprehensive detail.Further evaluation of the 74 full-text articles resulted in the exclusion of 57 articles [8,  for various reasons, including the nature of the content being reviews, supplements, or conference abstracts; the absence of MRI application; retraction status; irrelevance to the scope of the current meta-analysis; or the inadequacy of reported outcomes for quantitative synthesis (refer to Table S4).This selection process culminated in the selection of 17 studies [11,12,[83][84][85][86][87][88][89][90][91][92][93][94][95][96][97] for detailed examination within the scope of this analysis.
algorithm type (U-net, U-net variants vs. CNN), training size, and preprocess niques (intensity normalization, resolution adjustment, image augmentation, an cropping).Statistical analysis was conducted with the metafor package in R, con p < 0.05 as significant.

Study Identification and Selection
The PRISMA diagram (Figure 1) illustrates the exhaustive search and selecti odology adopted in the present investigation.Initially, a comprehensive search ducted across various databases from inception to 20 March 2024, yielding 176 comprising 36 from PubMed, 72 from EMBASE, and 68 from Web of Science.duplicates were removed, 110 articles were further assessed using EndNote soft initial review of titles and abstracts led to the exclusion of 36 articles, attributed irrelevance or lack of comprehensive detail.Further evaluation of the 74 full-tex resulted in the exclusion of 57 articles [8,  for various reasons, including th of the content being reviews, supplements, or conference abstracts; the absenc application; retraction status; irrelevance to the scope of the current meta-analys inadequacy of reported outcomes for quantitative synthesis (refer to Table S4).T tion process culminated in the selection of 17 studies [11,12,[83][84][85][86][87][88][89][90][91][92][93][94][95][96][97] for detailed tion within the scope of this analysis.

Basic Characteristics of Included Studies
The seventeen investigations [83]- [97] implemented a retrospective appr compassing a cumulative patient population of 7830 individuals.The sizes of th cohorts exhibited significant variability, ranging from a minimum of 29 [11] to a m of 4100 [95] patients.A fundamental aspect of these investigations was the imp tion of manual annotation, underscoring the indispensable role of human experti the research framework.The methodologies for validation adopted across thes were bifurcated into either a train/test split [83][84][85][86][87][88][89]95,96] or cross-validation [ 94,97], with the criteria for annotation differing and encompassing evaluations b sionals such as experienced clinicians, radiologists, radiation oncologists, and on (Table 1).

Quality Assessment
Figure S1 illustrates the quality assessments of the included studies conducted with the QUADAS-2 tool.Supplementary Table S6 details an analysis focusing on bias-related risks and applicability concerns, identifying ambiguous risks due to the exclusion of interval derivation in datasets in 10 (58.8%) of the studies [12,[83][84][85][86][87][88][89]93,97], which may impact data interpretation.This criterion could influence the applicability and generalizability of the results from these studies.

Efficacy of DL Model Segmentation of NPC on MRI
The investigation synthesized findings from 11 studies, each utilizing distinct datasets and DL models for segmentation tasks, and uncovered notable variations in Dice scores, which spanned from 66% to 84%.The consolidated outcomes produced a pooled Dice score of 78%, with a 95% confidence interval (CI) ranging from 74% to 83% (Figure 2).The Q test indicated substantial heterogeneity across the studies, as evidenced by a Q value of 588.81 with a significance level below 0.01.Further affirmation of this heterogeneity was provided by the Higgins I 2 statistic, which reported a remarkably high degree of variability (I 2 = 99.02%).Sensitivity analysis reinforced the reliability of these findings, affirming the statistical significance of the summary effect sizes even upon the sequential exclusion of individual studies from the analytical framework (Figure S2).Additionally, the funnel plot assessment of the 11 studies, coupled with Egger's regression test, disclosed a p-value of 0.037, intimating the presence of publication bias within the examined corpus of studies (Figure S3).Nevertheless, subsequent analysis through subgrouping predicated on publication metrics failed to disclose any significant discrepancies (Figure S4).sequential exclusion of individual studies from the analytical framework (Figure S2).Additionally, the funnel plot assessment of the 11 studies, coupled with Egger's regression test, disclosed a p-value of 0.037, intimating the presence of publication bias within the examined corpus of studies (Figure S3).Nevertheless, subsequent analysis through subgrouping predicated on publication metrics failed to disclose any significant discrepancies (Figure S4).Employing a sophisticated meta-analytic methodology, a three-level meta-analysis was undertaken to scrutinize potential moderating factors associated with DL models Employing a sophisticated meta-analytic methodology, a three-level meta-analysis was undertaken to scrutinize potential moderating factors associated with DL models utilized in segmentation tasks.This meticulous examination included an extensive assessment of outcomes across numerous validation sets, augmented by clustering according to datasets to mitigate the impact of their repeated utilization.From an aggregation of 68 reported effects spanning 17 distinct studies, the mean Dice coefficient was calculated to be 76.4%, with a 95% CI ranging from 71.1% to 81.6%.The Q statistic analysis revealed an absence of significant heterogeneity, evidenced by a Q value of 55.4 (p = 0.821).Comparative evaluations employing Akaike and Bayesian information criteria highlighted a preference for the three-level model over conventional two-tiered approaches, highlighting its superior accuracy in representing the data structure.Further, variance analysis elucidated that 58.61% of the total variance was attributable to level 1 (sampling variance), with the residual variance delineated between within-dataset disparities (4.6e-8%) at level 2 and inter-dataset differences (41.39%) at level 3.This distribution of variability underscored significant inter-dataset variation, in contrast to negligible within-dataset discrepancies (Supplementary Table S5), reinforcing previously observed significant heterogeneity in meta-analyses of independent datasets.Meta-regression analysis probing factors such as dataset splitting, validation methodology, MRI sequence, algorithmic typology, training volume, and preprocessing approaches did not yield significant correlations with the segmentation efficacy of DL models.

Discussion
The primary objective of this systematic review and meta-analysis was to assess the efficacy and accuracy of DL models, specifically in the segmentation of NPC in MRI.In the landscape of medical imaging, especially for conditions like NPC where precision in diagnosis and treatment planning is critical, the role of DL technologies marks a transformative potential.By focusing on MRI, this review targets an area where DL models can significantly leverage high-resolution images for better disease characterization.

Summary of Findings
Our comprehensive analysis revealed that DL models, particularly convolutional neural networks (CNNs), enhance the accuracy of NPC segmentation in MRI scans.The pooled analysis of Dice scores, a key metric for evaluating segmentation accuracy, included 11 studies with a total of 7830 patients or MRI scans.Using a random effects model, we calculated a pooled mean Dice score of 78% (95% confidence interval: 74% to 83%) across the included studies (Figure 2).The Dice score ranges from 0 to 1, with higher values indicating better segmentation accuracy.Heterogeneity among the studies was assessed using the Q test and the I 2 statistic.The Q test indicated substantial heterogeneity across the studies (Q = 588.81,p < 0.01), and the I 2 statistic revealed a high degree of variability (I 2 = 99.02%).To explore the potential sources of heterogeneity, we conducted subgroup analyses and meta-regressions on variables such as publication status, MRI sequence, algorithm type, and preprocessing techniques (see Section 3.6 for details).The funnel plot assessment and Egger's regression test (p = 0.037) suggested the presence of publication bias within the examined studies (Figure S3).However, further subgroup analysis based on publication status did not reveal any significant discrepancies (Figure S4).These findings underscore the effectiveness of DL models in improving NPC segmentation accuracy in MRI scans compared to traditional methods.The pooled mean Dice score of 78% indicates a moderate to high level of segmentation accuracy, highlighting the potential of DL models to enhance clinical decision making and treatment planning in NPC management.However, it is important to acknowledge the substantial heterogeneity observed among the included studies, which may stem from differences in patient populations, MRI acquisition protocols, and DL model architectures.

Comparison with the Existing Literature
Previous reviews have extensively covered various applications of deep learning and machine learning for nasopharyngeal carcinoma (NPC) [98][99][100].In the review by Li et al. [98], the authors briefly outline articles related to auto-segmentation using deep learning techniques.Ng et al. [99] presented a descriptive box plot in their study of autotargeting, showing a median Dice score of 0.7530, which illustrates the current performance level in this field.Wang et al. [100] discussed the advantages and disadvantages of different imaging modalities.They noted that while CT images often lack sufficient soft tissue contrast, PET images provide excellent tumor visualization but fail to deliver accurate boundary information due to their low spatial resolution.Dual-modality PET-CT images, however, offer more valuable information for delineating tumor boundaries and assessing the extent of tumor invasion [101].Despite its superior soft tissue contrast, MRI is considered the gold standard for staging and measuring target volume contours in NPC.However, identifying tumor margins on MRI can be challenging due to factors such as high variability, low contrast, and discontinuous soft tissue margins.While discussions on auto-segmentation using deep learning methods are present, there is a notable lack of comprehensive and quantitative analysis in the existing literature.
Compared to previous systematic reviews and meta-analyses on CT and MRI segmentation of nasopharyngeal cancer, our focused investigation into NPC segmentation exclusively using MRI technology represents a more specialized inquiry into this domain [14].Our review not only corroborates the effectiveness of deep learning models in NPC segmentation, demonstrating a pooled Dice score of 78%, closely aligning with prior findings of 76% [14], but it also introduces several key differentiators that enhance the robustness and relevance of our conclusions.Notably, our review incorporates five additional studies from 2023 and 2024, broadening the evidence base.Our emphasis on MRI scans allowed for more nuanced data extraction and analysis, ensuring a deeper understanding of this specific imaging modality's challenges and opportunities in NPC segmentation.Furthermore, we employed a two-pronged meta-analysis approach: a traditional two-level random effects model that addressed independent datasets and a novel three-level random effects model that accounted for all reported results across validation sets, effectively clustering by dataset.This methodology revealed significant heterogeneity among independent datasets, indicating the necessity for further research to explore the sources of this variability.Future studies are encouraged to expand the dataset to illuminate these findings further and comprehensively address the identified heterogeneity.

Strengths of Deep Learning Models
DL models handle complex, high-dimensional data, and are ideally suited for medical imaging tasks.Their strengths lie in rapid processing, high accuracy, and reproducibility, as demonstrated by models like nnU-Net [86] and CDDSA [87], which exhibited exemplary performance in our review.The nnU-Net (no-new-Net) [102] represents a significant stride in the application of deep learning for medical image segmentation, specifically highlighted in our review by its exceptional performance in NPC segmentation within MRI scans.Achieving a Dice score of 0.88, the nnU-Net not only demonstrates its robustness in precisely delineating the tumor boundaries in NPC but also underscores the model's capability in handling the inherent complexities of medical imaging data.This performance is particularly noteworthy given the challenging nature of NPC, a cancer type characterized by its intricate anatomical location and the potential for subtle imaging signatures.
nnU-Net's architecture is designed to automatically adapt to the segmentation task's specifics, including optimizing its configuration to match the input data dimensions, preprocessing routines, and network architecture parameters.This adaptability is key to its success, enabling the nnU-Net to efficiently process the high-dimensional data typical of MRI scans, thereby ensuring high accuracy and reproducibility across different datasets and segmentation tasks.The model's proficiency in capturing the nuanced details of NPC tumors from MRI without the need for extensive manual tuning or intervention represents a paradigm shift from traditional segmentation approaches, which are often time consuming and prone to inter-and intra-observer variability.By automating the segmentation process while maintaining, if not exceeding, the accuracy of manual methods, the nnUNet not only enhances diagnostic workflows but also paves the way for more personalized and timely treatment planning, leveraging the full potential of deep learning to improve patient care outcomes in oncology.
Comparing the three models, the nn-U-Net [86], CDDSA [87], and CNN [11], the studies using CDDSA and CNN demonstrated higher performance than the one using the nn-U-Net.All three studies utilized extensive preprocessing techniques such as intensity normalization, image augmentation, and image cropping.The study using the nn-U-Net [86] additionally employed resolution adjustment.It is important to note that the CNN study [11] from 2018 had a limited sample size of only 29 patients, which may affect the robustness and generalizability of their model's performance.In contrast, the study using the nn-U-Net [86] included 1057 patients and performed external validation, demonstrating the most robust validation among the three.The CDDSA study [87] used 189 patients with internal validation, which can be considered decent.Moreover, the disentangle-based style augmentation technique utilized in the CDDSA study may have contributed to its high performance.

Limitations and Challenges
Despite the promising outcomes, our review faced limitations, including evident publication bias and significant study heterogeneity, which could influence the interpretability of our results.Moreover, while being the standard for comparison, the manual segmentation process introduces subjectivity and variability in outcomes.DL models, though superior, are not without challenges, including the need for extensive training data and the complexity of model tuning to achieve optimal performance.

Implications for Clinical Practice
Integrating DL models into clinical settings for NPC segmentation from MRI scans could revolutionize treatment planning and prognosis evaluation.The precision of DLenhanced segmentation can lead to more accurate staging, targeted therapy, and monitoring strategies.However, for such integration to be successful and globally applicable, there is a critical need for standardization in DL model development, validation, and implementation across different healthcare contexts.

Future Research Directions
Future research should aim at developing more advanced DL models capable of accommodating the variability inherent in MRI data, including differences in imaging parameters and tumor presentation.Moreover, exploring DL applications beyond segmentation, such as in treatment response assessment and recurrence detection in NPC, could provide comprehensive tools for holistic disease management.This direction promises improvements in clinical outcomes and paves the way for personalized treatment approaches based on predictive analytics.

Conclusions
Our systematic review and meta-analysis have highlighted the effectiveness of deep learning (DL) models in improving the accuracy of nasopharyngeal carcinoma (NPC) segmentation in MRI scans, with a pooled mean Dice score of 78% (95% confidence interval: 74% to 83%), indicating a moderate to high segmentation accuracy in DL models.DL's role in medical imaging, particularly for NPC, marks a significant advancement that matches the growing need for precision in medical diagnostics.However, the substantial heterogeneity and the presence of publication bias observed necessitate a careful interpretation of these results.They emphasize the need for further validation and standardization of DL models across varied clinical environments to confirm their effectiveness and consistency.While current deep learning models achieve moderate to high segmentation accuracy, further optimization and improvement of deep learning architectures are warranted.As we look forward, integrating DL into clinical practice is set to transform NPC management by equipping clinicians with more accurate tools, potentially enhancing personalized treatment and patient outcomes.Future research should extend the use of DL to other areas, such as treatment response monitoring and intraoperative imaging, maximizing the benefits of this technology in cancer care.

Supplementary Materials:
The following supporting information can be downloaded at https: //www.mdpi.com/article/10.3390/bioengineering11050504/s1, Figure S1: The results of QUADAS-2 quality assessment for the included studies.Figure S2: The results of a sensitivity analysis of deep learning algorithms in independent datasets using the one-study removal method.Figure S3: Funnel plot of Dice scores for deep learning algorithms in independent datasets.Figure S4: Forest plot of subgroup analysis deep learning algorithms in an independent dataset using publication status as moderator.Table S1: PRISMA-DTA Abstract Checklist.Table S2: PRISMA-DTA Checklist.Table S3: Keywords and search results in different databases.Table S4: Excluded articles and reasons.Table S5: Comparison of multilevel meta-analysis model clusters with datasets of segmentation Dice scores across all validation sets.Table S6: Quality assessment according to the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) criteria.Table S7: The Checklist for Artificial Intelligence in Medical Imaging scores.Institutional Review Board Statement: This meta-analysis did not intervene or interact with humans or collect identifiable private information and thus does not require institutional review board approval.
Informed Consent Statement: Not applicable.

Table 2 .
Characteristics of MRI.

Table 3 .
Characteristics and performance of preprocessing techniques and deep learning algorithms.